Extracting the information backbone in online system 
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Information overload is a serious problem in modern society and many solutions such as recom- 
mender system have been proposed to filter out irrelevant information. In the literature, researchers 
mainly dedicated to improve the recommendation performance (accuracy and diversity) of the algo- 
rithms while overlooked the influence of topology of the online user-object bipartite networks. In this 
paper, we find that some information provided by the bipartite networks is not only redundant but 
also misleading. With such "less can be more" feature, we design some algorithms to improve the 
recommendation performance by eliminating some links from the original networks. Moreover, we 
propose a hybrid method combining the time-aware and topology-aware link removal algorithms to 
extract the backbone which contains the essential information for the recommender systems. From 
the practical point of view, our method can improve the performance and reduce the computational 
time of the recommendation system, thus improve both of their effectiveness and efficiency. 
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I. INTRODUCTION 



Nowadays, we are facing too much information from 
online systems. We have to make choices from thou- 
sands of movies, millions of books, billions of web pages, 
and so on. The abundant information makes it impos- 
sible to go through every candidate products to select 
the most suitable one. In order to address this prob- 
lem, many recommendation algorithms have been pro- 
posed [l|. These recommendation systems analyze the 
purchase history of each user and return with a small 
number of the most relevant products for him/her. Ex- 
amples include popularity-based (PR) method, collabo- 
rative filtering (CF) method mass diffusion (MD) 
method [1], heat conduction (HC) method [H, the hybrid 
method of mass diffusion and heat conduction [oj and so 
on. 

The online commercial systems can be represented by 
the user-object bipartite networks. The recommenda- 
tion algorithm usually make use of the whole network 
and the recommendation list is generated based on an- 
alyzing all the items bought by the target user 0, 
When the recommendation accuracy is low in some spe- 
cific online systems, researchers always explain it by the 
data sparsity [l|. It is widely believed that the recom- 
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mendation performance is strongly related to the data 
amount. However, this common sense might not be true 
in reality. For instance, when a user bought some items 
long time ago, these items cannot correctly reflect the 
current taste of this user. Furthermore, there are always 
some very popular items, which are almost collected by 
every user (e.g. some super popular movies watched by 
everyone). In this case, if a user bought such item, the 
recommender system cannot extract much information 
about the user's preference from this purchase action. 
Therefore, some links in the online user-object bipartite 
networks can be redundant or even misleading. Appro- 
priately eliminating some connections from the networks 
might be able to further improve the network function (in 
our case, recommendation performance). Actually, this 
"less can be more" phenomenon has already been found 
in many dynamic process. The most well-known example 
is the synchronization process, in which the synchroniz- 
ability can be enhanced by removing links 0, [l^ . 

The "less can be more" feature indicates that there 
might be backbone structures in the original networks. 
Generally, a backbone should preserve the topological 
properties or the function of the orig inal networks. For 
example, the degree distribution betweenness Il2l . 
synchronizability [Tsl . [l^ and transportation ability |15| 
can be preserved. In online systems, we propose the con- 
cept of information backbone which is supposed to pre- 
serve the essential information needed for recommenda- 
tion. By using the information in the backbone struc- 
tures, the recommender systems are able to make as ac- 
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curate prediction of users' interested items as the original 
networks. 

In this paper, we consider two main categories of hnk 
removal process: time-aware and topology-aware algo- 
rithms. We find that both types of algorithms can remove 
links without significantly harming the recommendation 
performance. Generally, the time-aware algorithms work 
better in preserving recommendation accuracy while the 
topology-aware algorithms have advantage in enhancing 
the recommendation diversity. We then hybrid these two 
type of algorithms and achieve a further improvement in 
preserving the information for recommendation. By us- 
ing the hybrid algorithm, we obtain the above-mentioned 
information backbone from the real user-object bipar- 
tite networks (The number of links is reduced by about 
80%). Moreover, the structure properties of the informa- 
tion backbone are analyzed in detail. Finally, we remark 
that our method is very meaningful from the practical 
point of view since it can largely reduce the computa- 
tional cost of the recommender systems. 



II. METHODS AND MATERIALS 
A. Data Description 

We adopted two standard datasets with time informa- 
tion: Netflix [m and Movielens [HI. The Netflix data 
was sampled from the huge dataset provided for the Net- 
flix Prize. The data is from Feb. 2001 to May 2001 with 
8,609 users and 5,081 items. We use the links during the 
first 3 months as the training set and denote it as Et- 
Among the remaining links, we randomly select some of 
them as the probe set which is denoted as Ep. Since 
the size of Ep cannot be too large compared to Et, we 
set Ep/Et ~ 10% in our paper. The training set is 
treated as known information while the probe set is used 
for testing and no information in this set is allowed to be 
used for recommendation. The training set Et of Movie- 
lens was sampled from the data collected from unix time 
912578016 to 1058210533, i.e. from 2 Dec. 1998 to 15 
Jul. 2003. It consists of 5,547 users and 5,850 items. Af- 
ter the unix time 1043723983, the remaining 69,805 links 
are chosen for the probe set Ep {Ep/Et ~ 10%). Note 
that in order to avoid the cold-start problem, we remove 
all the new users (who rated no items in the training set) 
and new items (which are not rated by any user in the 
training set) from the above two probe sets. The simu- 
lation is also carried out in other subsets of Netflix and 
Movielens data and the results are robust, so we only 
show the result of the above two subsets. 

These online commercial systems can be well described 
by user-object bipartite networks [lB|. If a user collects 
an item, a link is drawn between them. Specifically, we 
consider a system of N users and M items represented 
by a bipartite network with adjacency matrix A, where 
the element ciia = 1, if a user i has collected an object 
a, and Uia — 0, otherwise (throughout this paper we 



use Greek and Latin letters, respectively, for object- and 
user- related indices). The aim of the recommender sys- 
tem is to predict which item is most favored by each user, 
i.e. which element in A is going to change from to 1 in 
the future. 



B. Link removal algorithms 

In order to examine whether there is abundant (or even 
misleading) information in the online user-object bipar- 
tite networks, we consider two categories of link removal 
algorithms: time-aware and topology-aware algorithms. 

time-aware algorithms use the time information to as- 
sign a score for each pair of connected nodes, which is 
directly defined as their relevance with the underlying 
assumption that a relevant connection is likely to be a 
part of the information backbone for recommendation. 
Here are some typical algorithms: 

(1) System oldest removal (SOR): The link appeared 
earliest among all the remaining links is removed. 

(2) System newest removal (SNR): The link appeared 
latest among all the remaining links is removed. 

(3) Individual oldest removal (lOR): The oldest link 
for each target user is removed. 

(4) Individual newest removal (INR): The newest link 
for each target user is removed. 

topology- aware algorithms use the network structure to 
compute the relevance of each link ia. Also, we consider 
four typical algorithms: 

(5) Most popular removal (MPR): The popularity of 
a link ia is defined as kika, where ki (fc^) is degree of 
user i (item a). We calculate the popularity of all the 
remaining links and remove the most popular links. 

(6) Least popular removal (LPR): The most unpopular 
links will be removed. 

(7) Most rectangles removal (MRR): A rectangle is de- 
fined as a subgraph consisting of four links from two users 
to two items. We calculate the number of rectangles that 
each link belongs to, then we remove the link with most 
rectangles. 

(8) Fewest rectangles removal (FRR): We remove the 
link with fewest rectangles. 

Finally, we consider a benchmark algorithm for com- 
parison. 

(9) Random removal (RR): Link is randomly chosen 
and removed. 

In order to make all the algorithms comparable, all 
links should be removed in 50 macro-steps. Therefore, 
around 2 percent links will be chosen in each macro-step. 
For example, if there are 90 links in the original network, 
on average 90/50 = 1.8 links should be removed in each 
macro-step. After nth macro-step, [1.8n] links will be 
removed from the network. In lOR and INR algorithms, 
the number of links to be removed for each user is pro- 
portional to his degree in each macro-step. 
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C. Recommender system 

In this paper, we employ the weh-known user-based 
collaborative filtering (UCF) as the standard recommen- 
dation system 0,11]. In UCF, the recommendation score 
of an item is evaluated by the similarity Sij between 
the target user and the users who collected the item. 
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Actually, the measure of similarities of two nodes in a 
network is subject to definition. In this paper, we use 
the Salton index [l^ to calculate the similarity between 
users. For a node i, let denote the set of neighbours 
of i, the Salton index is written as 



^ykixkj 



(2) 



where ki = \Ti\ denotes the degree of i. The Salton index 
is also called the cosine similarity in some literatures [l| . 

In this paper, we use several standard metrics to eval- 
uate the recommendation results [l| . The first one is the 
area under the receiver operating characteristic (ROC) 
curve which is used to quantify the accuracy of recom- 
mendation [Tsl . In the present case, this metric can be 
interpreted as the probability that a randomly chosen 
item in I's probe set is given a higher score than a ran- 
domly chosen item which is rated by i neither in training 
set nor in probe set. In the implementation, among n 
times of independent comparisons, if there are n' times 
the item in probe set having higher score than the item 
in the training set and n" times they having the same 
score, the accuracy is defined as: 
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Since real users usually consider only the top part of 
the recommendation list, a more practical measure may 
be to consider the number of user i's links in probe set 
contained in the top L places (It is set as L = 20 in this 
paper). This measurement is usually referred as preci- 
sion [l9[ of the recommendation system and the top-L 
precision is defined as 



m) = 



(4) 



where Ri{L) indicates the number of relevant objects 
(namely the objects collected by i in the probe set) in 
the top-L places of recommendation list. 

Averaging over all the users, we obtain the accu- 
racy and precision of the whole system, as AUC = 

W Eti and P{L) = ± E^i W)- 

Diversity is also an important aspect of recommender 
system [ij. Here we adopt inter- user diversity which is 
defined by considering the uniqueness of different users' 



recommendation lists. Given two users i and j, the dif- 
ference between their recommendation lists can be mea- 
sured by Hamming distance. 
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(5) 



where Qij{L) is the number of common objects in top- 
L places of both lists. Clearly, if user i and j have the 
same list, Hij{L) = 0, while if their lists are completely 
different, Hij{L) = 1. Averaging Hij{L) over all pairs of 
users we obtain the mean distance H{L). 



D. Structure indices 

After removing links, we will compare the structure 
features of the obtained network and the original net- 
work. The first one is the clustering coefficient [20| , which 
is defined as the quotient between the number of rectan- 
gles and the total number of possible rectangles. For a 
given node i, its clustering coefficient reads 



C4(z) 
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where m and n label neighbors of node i, qi{m,n) are 
the number of common neighbors between m and n and 
ai{m,n) = {km-r]i{m,n)){kn-r]t{m,n)) with r],{m,n) = 

1 + qi{m, n). Here we calculate the the average clustering 
coefficient of users, items and the whole network respec- 
tively. Note that since the nodes whose degrees are below 

2 cannot form any rectangle, we do not take these nodes 
into account when we calculate the cluster coefficient. 

Secondly, we consider the assortative coefficient j2ll|. 
which is the Pearson correlation coefficient of degree be- 
tween pairs of linked nodes. 
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where \E\ is the number of links in a network. Another 
related index is the degree heterogeneity, calculated on 
both user side and item side through H = {k'^)/{k)^. 

We also consider the 3-step diffusion range (DR). It 
is strongly related to the recommendation process since 
many recent recommendation algorithms are based on 
the diffusion process Q. For a given node i, the 3-step 
diffusion range is simply the fraction of covered nodes if 
the diffusion starts from node i and propagates 3 steps. 
The 3-stcp diffusion range of a network is the average 
value of all nodes. 
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III. RESULTS 

A. "Less can be more" phenomenon in online 
systems 

It is usually believed that the more historical infor- 
mation we gather, the more accurate the prediction can 
be. However, this common sense is not always true, es- 
pecially in recommendcr system. In order to examine 
whether there is abundant (or even misleading) infor- 
mation in the online user-object bipartite networks, we 
adopted two standard datasets with time information: 
Nctflix and Movielens. We first recall that our main ob- 
jective is to investigate how much information is needed 
to correctly predict the links in the probe set and which 
link removal algorithm is most effective in extracting the 
essential information from the training set. In our simu- 
lation, we will step by step remove links from the train- 
ing set according to different algorithms (see Subsection 
IIIBI "Link removal algorithms"). After each macro-step, 
we will monitor the change of the recommendation per- 
formance, namely the recommendation accuracy, preci- 
sion and diversity (see Subsection III CI "Recommender 
system"). Note that with the macro-step increases, the 
number of links in the training set gradually decreases 
while the size of the probe set is always kept unchanged. 

The results for the time-aware algorithms are reported 
in Fig. [T] (note that only the most related results are 
plotted here for the sake of clear presentation and the 
comprehensive comparison is shown in Fig. ISII in Ap- 
pendix SI). Interestingly, instead of decreasing, the AUG 
and P{L) can increase as the links are removed from the 
network based on some algorithms. Overall speaking, 
SOR and lOR perform better in time-aware algorithms, 
while the recommendation accuracies of the other two, 
i.e., SNR and INR, decline sharply. Many studies have 
revealed that putting less weight on the old links can 
indeed improve the recommendation performance [2^ . 
Therefore, SOR and lOR work well in the link removal 
process. In our simulation, we observe that lOR is gener- 
ally better than SOR. This is because SOR may remove 
all links for some small degree users, which leads to very 
serious cold-start problem. 

The results for the topology-aware algorithms are re- 
ported in Fig. [5] (again only the most related results 
are plotted for the sake of clear presentation and the 
comprehensive comparison is shown in Fig. IS2I in Ap- 
pendix SI). In the topology-aware algorithms, the MPR 
and MRR arc more accurate than others. In the previ- 
ous literatures, it shows that the recommendation perfor- 
mance is strongly related to the clustering effect of the 
networks More specifically, the more rectangles the 
network has, the more accurate the recommendation can 
be. In this sense, the link with few rectangles do not have 
much information and should be removed first from the 
network. However, we show that MRR algorithm per- 
forms far better than the FRR. Similar phenomenon is 
observed in the algorithms which consider the link pop- 



ularity. In the item side, the most popular items are 
bought by almost all the users. The links connecting to 
the hub items cannot reflect the real taste of users. Like- 
wise, a high degree users are interested in many different 
kinds of items. If an item is collected by such user, the 
recommendation system cannot determine the intrinsic 
property of this item and thus cannot predict the poten- 
tial users who might like it. Therefore, the links with low 
popularity generally contain more information. More- 
over, the MPR and MRR algorithms not only help the 
recommendation system to reveal the real taste of users, 
but also improve the recommendation diversity (see Fig. 
[2]). 

In both Fig. [1] and Fig. [51 we plotted the results of 
random removal (RR) for comparison. It seems that the 
recommendation accuracy can be also well preserved in 
RR algorithm. However, RR cannot improve the AUG 
and precision by removing links as the SOR, lOR, MPR 
and MRR algorithms. Besides, the recommendation di- 
versity is very low when using the RR algorithm. Since 
the links of the small degree users and unpopular items 
have the same probability as the other links to be re- 
moved, the RR algorithm will cause quite serious cold- 
start problem. 

The phenomenon above indicates that there is "less 
can be more" feature in the online recommendation sys- 
tem. At the beginning, some redundant and misleading 
links are deleted, which improves the recommendation 
accuracy and precision. As links are removed, some nec- 
essary information for the recommendcr systems will be 
inevitably destroyed, and thus both the accuracy and pre- 
cision decrease in the final part of link removal process 
as shown in Fig. [T] These results imply that there is an 
information backbone of these online bipartite networks. 



B. The information backbone and the related 
topology properties 

By comparing the performances of different removal 
algorithms, we find that both the time-aware algorithms 
and topology-aware algorithms can remove the redun- 
dant and misleading information from the networks. 
However, each type of methods has its own advantage. 
The time-aware algorithms work better in preserving rec- 
ommendation accuracy while the topology-aware algo- 
rithms have advantage in enhancing the recommenda- 
tion diversity. One very straight forward extension is 
to hybrid the methods to better extract the information 
backbone from the online bipartite networks. For sim- 
plification, we chose SOR in the time-aware algorithms 
and MPR in the topology-aware algorithms. We use a 
tunable parameter A in the hybrid method to adjust the 
tendency for the SOR algorithm and MPR algorithm. In 
practice, a random number N^and between and 1 is 
generated before removing a link. If Nrand > A, the link 
should be selected according to SOR; or else, it should 
be selected according to MPR. 
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FIG. 1: The variation tendencies of AUC, P{L) and H{L) with the macro-step increases, step-i is named the 
identifier of ith macro-step. Tfie resuits of Netflix are shown in sub-figures (Netfiix-1), (Netfiix-2) and (Netflix-3), and those 
of Movielens are shown in sub-figures (Movielens-1), (Movieiens-2) and (MovieIens-3) . Note that, oniy the best performed 
time-aware algoritlims (SOR and lOR) are compared with 'Random removal (RR)' here. A comprehensive comparison among 
these time-aware algorithms is shown in Fig. ISll in Appendix SI. 




FIG. 2: The variation tendencies of AUC, P{L) and H{L) with the macro-step increases, step-z is named the 
identifier of ith macro-step. The results of Netflix are shown in sub-figures (Netflix-l), (Netflix-2) and (Netflix-3), and those 
of Movielens are shown in sub-figures (Movielens-1), (Movielens-2) and (Movielens-3) . Note that, only the best performed 
topology-aware algorithms (MPR and MRR) are compared with 'Random removal (RR)' here. A comprehensive comparison 
among these topology-aware algorithms is shown in Fig. IS2l in Appendix SI. 



The results of this hybrid method are shown in Fig. [S] recommendation accuracy and precision can stay rela- 
When A = (pure time-aware algorithm), although the tively high even a lot of links are removed, the recommen- 
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dation diversity is not satisfying enough. When A = 1 
(pure topology- aware algorithm), the recommendation 
diversity can be very close to the maximum 1. However, 
the recommendation accuracy and precision drop quickly 
as the links are removed. The hybrid algorithm is able 
to keep a reasonable balance between recommendation 
diversity and accuracy. Moreover, the hybrid algorithm 
can sometimes even outperform the time-aware algorithm 
in preserving the recommendation accuracy when a large 
number of links are removed from the networks. 

With the hybrid method, we further move to extract 
the information backbone from the bipartite networks. 
One immediate question is how many links should be re- 
moved. Here, we use a simple criteria to determine the 
optimal number of links to remove. As discussed above, 
the backbone should effectively preserve the recommen- 
dation accuracy of the original networks. In the hybrid 
method, links are removed until the AU C is lower than 
95% of the AUC of the original networks. We select the A 
under which the number of removed links are the largest. 
Note that, when there are several A with the same num- 
ber of removed links, we select the one with the highest 
recommendation diversity. In the way, we can get the 
information backbone of the original networks. In this 
backbone, the recommendation performance is preserved 
and the recommender systems only have to deal with a 
small number of links (72% and 80% links are removed 
in Movielens and Netfiix, respectively). The related re- 
sults can be seen in Table HI It shows that the resulting 
network from the hybrid algorithm has both high recom- 
mendation accuracy and diversity compared to the pure 
algorithms. 

Next, we try to investigate the structure features of the 
obtained information backbone. We compare the origi- 
nal networks and the obtained information backbone in 
four structure indices here: clustering coefficient, assor- 
tativity, degree heterogeneity and 3-step diffusion range 
(See subsection lll Dl "Structure indices"). The structural 
properties of the initial network and the resulting net- 
works by different algorithms can be also seen in Table 
|T1 Clearly, the structure properties of the network from 
the hybrid algorithm (which we call "information back- 
bone") is between the SOR and MPR algorithms. The 
clustering coefficient of the information backbone is in- 
evitably smaller than the original networks since cluster- 
ing coefficient is strongly related to the link sparsity. For 
the assortativity, the information backbone generally has 
higher value than the original networks. As mentioned 
above, the links to the hubs items cannot reflect the real 
interests of the users, so these links are removed from 
the networks. Therefore, a lot of links connecting to hub 
items and hub users are removed. As a result, the as- 



sortativity is generally larger in the backbone networks 
and this also explains why the degree heterogeneity of the 
backbone network is generally smaller. As for the 3-step 
diffusion range, the information backbone contains essen- 
tial information for recommendation system. The items 
reached by 3-stcp diffusion are almost all the items which 
might be interested by the users. The wrong items arc no 
longer covered by the diffusion. Therefore, the diffusion 
range is much smaller than the original networks. 



IV. DISCUSSION 

The rapid expansion of the internet leads to an in- 
creasing amount of information from the World Wide 
Web. Recommendation algorithms are thus proposed 
to address the problem of information overload. Pre- 
vious recommendation algorithms use all the available 
information of the online user-object bipartite networks 
to generate the recommendation list. We find, however, 
that some links in the networks might be redundant and 
misleading. Therefore, we proposed a hybrid algorithm 
combining both the time and topology information to re- 
move unnecessary links. In this way, we obtained the 
information backbone which contains the essential infor- 
mation for recommendation. 

Nowadays, the recommendation systems have to deal 
with very large amount of data to generate personal- 
ized recommendation for each user. Actually, the back- 
bone extraction method can be regarded as the data pre- 
treatment. Before the recommendation is implemented, 
the amount of data can be significantly reduced by our 
method while the recommendation results can stay al- 
most the unchanged. In this sense, our method can be 
very meaningful in practical point of view since it can 
largely reduce the computational cost of the recommen- 
dation systems. 
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FIG. SI: The variation tendencies of AUC, P{L) and H{L) with the macro-step increases, step-i is named the 
identifier of ith macro-step. The resuits of Netflix are shown in sub-figures (Netfiix-1), (Netfiix-2) and (Netflix-3), and those 
of Movielens are shown in sub- figures (Movielens-1), (Movielens-2) and (Movielens-3). This figure focuses on the time-aware 
algoritlims. 
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FIG. S2: The variation tendencies of AUC, P{L) and H(L) with the macro-step increases, step-i is named the 
identifier of ith macro-step. The results of Netflix are shown in sub-flgures (Netfiix-1), (Netflix-2) and (Netfiix-3), and those of 
Movielens are shown in sub-figures (Movielens-1), (Movielens-2) and (Movielens-3). This figure focuses on the topology-aware 
algorithms. 



